Texture Segmentation Based Video Compression Using Convolutional Neural Networks
نویسندگان
چکیده
There has been a growing interest in using different approaches to improve the coding efficiency of modern video codec in recent years as demand for web-based video consumption increases. In this paper, we propose a model-based approach that uses texture analysis/synthesis to reconstruct blocks in texture regions of a video to achieve potential coding gains using the AV1 codec developed by the Alliance for Open Media (AOM). The proposed method uses convolutional neural networks to extract texture regions in a frame, which are then reconstructed using a global motion model. Our preliminary results show an increase in coding efficiency while maintaining satisfactory visual quality. Introduction With the increasing amount of videos being created and consumed, better video compression tools are needed to provide fast transmission and high visual quality. Modern video coding standards utilize spatial and temporal redundancy in the videos to achieve high coding efficiency and high visual quality with motion compensation techniques and 2-D orthogonal transforms. However, efficient exploitation of statistical dependencies measured by a mean squared error (MSE) does not always produce the best psychovisual result, and may require higher data rate to preserve detail information in the video. Recent advancement in GPU computing has enabled the analysis of large scale data using deep learning method. Deep learning techniques have shown promising performance in many applications such as object detection, natural language process, and synthetic images generation [1, 2, 3, 4]. Several methods have been developed for video applications to improve coding efficiency using deep learning. In [5], sample adaptive offset (SAO) is replaced by a CNN-based in-loop filter (IFCNN) to improve the coding efficiency in HEVC. By learning the predicted residue between the quantized reconstructed frames obtained after deblocking filter (DF) and the original frames, IFCNN is able to reconstruct video frames with higher quality without requiring any bit transmission during coding process. Similar to [5], [6] proposes a Variable-filter-size Residue-learning CNN (VRCNN) to improving coding efficiency by replacing DF and SAO in HEVC. VRCNN is based on the concept of ARCNN [7] which is originally designed for JPEG applications. Instead of only using spatial information to train a CNN to reduce the coding artifacts in HEVC, [8] proposed a spatial temporal residue network (STResNet) as an additional in-loop filter after SAO. A rate-distortion optimization strategy is used to control the on/off switch of the proposed in-loop filter. There are also some works that have been done in the decoder of HEVC to improve the coding efficiency. In [9], a deep CNN-based auto decoder (DCAD) is implemented in the decoder of HEVC to improve the video quality of decoded video. DCAD is trained to learn the predict residue between decoded video frames and original video frames. By adding the predicted residual generated from DCAD to the compressed video frames, this method enhances the compressed video frames to higher quality. In summary, the above methods improve the coding efficiency by enhancing the quality of reconstructed video frames. However, they require different trained models for video reconstruction at different quantization levels. We are interested in developing deep learning approaches to only encode visually relevant information and use a different coding method for “perceptually insignificant” regions in a frame, which can lead to substantial data rate reductions while maintaining visual quality. In particular, we have developed a model based approach that can be used to improve the coding efficiency by identifying texture areas in a video frame that contain detail irrelevant information, which the viewer does not perceive specific details and can be skipped or encoded at a much lower data rate. The task is then to divide a frame into “perceptually insignificant” texture region and then use a texture model for the pixels in that region. In 1959, Schreiber and colleagues proposed a coding method that divides an image into textures and edges and used it in image coding [10]. This work was later extended by using the human visual system and statistical model to determine the texture region [11, 12, 13]. More recently, several groups have focused on adapting perceptual based approaches to the video coding framework [14]. In our previous work [15], we introduced a texture analyzer before encoding the input sequences to identify detail irrelevant regions in the frame which are classified into different texture classes. At the encoder, no inter-frame prediction is performed for these regions. Instead, displacement of the entire texture region is modeled by just one set of motion parameters. Therefore, only the model parameters are transmitted to the decoder for reconstructing the texture regions using a texture synthesizer. Non-texture regions in the frame are coded conventionally. Since this method uses feature extraction based on texture segmentation technique, a proper set of parameters are required to achieve accurate texture segmentation for different videos. Deep learning methods usually do not require such parameter tuning for inference. As a result, deep learning techniques can be developed to perform texture segmentation and classification for the proposed model-based video coding. A Fisher vector convolution neural networks (FVCNN) that can produce segmentation labels for different texture classes was proposed in [16]. One of the advantage of FV-CNN is that the image input size is flexible and is not limited by the network architecture. Instead of doing pixel-wise classification on texture regions, a texture classification CNN network was described in [17]. To reduce computational expenses, [17] uses a small classification network to classify image patches with size of 227 × 227. A smaller network is needed to classify smaller image patches in our case. In this paper, we propose a block-based texture segmentation method to extract texture region in a video frame using convolutional neural networks. The block-based segmentation network classifies each 16 × 16 block in a frame as texture or non-texture. The identified texture region is then synthesized using the temporal correlations among the frames. Our method was implemented using the AOM/AV1 codec. Preliminary results show significant bitrate savings while maintaining a satisfactory visual quality.
منابع مشابه
A multi-scale convolutional neural network for automatic cloud and cloud shadow detection from Gaofen-1 images
The reconstruction of the information contaminated by cloud and cloud shadow is an important step in pre-processing of high-resolution satellite images. The cloud and cloud shadow automatic segmentation could be the first step in the process of reconstructing the information contaminated by cloud and cloud shadow. This stage is a remarkable challenge due to the relatively inefficient performanc...
متن کاملHand Gesture Recognition from RGB-D Data using 2D and 3D Convolutional Neural Networks: a comparative study
Despite considerable enhances in recognizing hand gestures from still images, there are still many challenges in the classification of hand gestures in videos. The latter comes with more challenges, including higher computational complexity and arduous task of representing temporal features. Hand movement dynamics, represented by temporal features, have to be extracted by analyzing the total fr...
متن کاملTexture segmentation with Fully Convolutional Networks
In the last decade, deep learning has contributed to advances in a wide range computer vision tasks including texture analysis. This paper explores a new approach for texture segmentation using deep convolutional neural networks, sharing important ideas with classic filter bank based texture segmentation methods. Several methods are developed to train Fully Convolutional Networks to segment tex...
متن کاملA hierarchical Convolutional Neural Network for Segmentation of Stroke Lesion in 3D Brain MRI
Introduction: Brain tumors such as glioma are among the most aggressive lesions, which result in a very short life expectancy in patients. Image segmentation is highly essential in medical image analysis with applications, particularly in clinical practices to treat brain tumors. Accurate segmentation of magnetic resonance data is crucial for diagnostic purposes, planning surgical treatments, a...
متن کاملA hierarchical Convolutional Neural Network for Segmentation of Stroke Lesion in 3D Brain MRI
Introduction: Brain tumors such as glioma are among the most aggressive lesions, which result in a very short life expectancy in patients. Image segmentation is highly essential in medical image analysis with applications, particularly in clinical practices to treat brain tumors. Accurate segmentation of magnetic resonance data is crucial for diagnostic purposes, planning surgical treatments, a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1802.02992 شماره
صفحات -
تاریخ انتشار 2018